This R code defines a function load.package() that checks if a specified package is already installed in the R environment. If the package is not installed, it installs it using the install.packages() function and then loads it into the current session using the library() function.
The function takes a single argument, package, which is a character string specifying the name of the package to be loaded.
The code then calls the load.package() function for each of the required packages for the analysis. These packages include dplyr, tidyr, tm, udpipe, SnowballC, vader, ggplot2, ggthemes, stringr, qdapTools, quanteda, quanteda.textstats, quantmod, and tibble.
By using this function, the code ensures that all the required packages are installed and loaded, without having to manually check each package and install it if necessary.
# Check if a package is installed and load it, otherwise install and load it
load.package <- function(package) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
} else {
library(package, character.only = TRUE)
}
}
# Load required packages
load.package("dplyr")
load.package("tidyr")
load.package("tm")
load.package("udpipe")
load.package("SnowballC")
load.package("vader")
load.package("ggplot2")
load.package("ggthemes")
load.package("stringr")
load.package("qdapTools")
load.package("quanteda")
load.package("quanteda.textstats")
load.package("quantmod")
load.package("tibble")
Loading required package: tm Loading required package: NLP Attaching package: ‘NLP’ The following object is masked from ‘package:httr’: content Loading required package: udpipe Loading required package: SnowballC Loading required package: vader Loading required package: ggplot2 Attaching package: ‘ggplot2’ The following object is masked from ‘package:NLP’: annotate Loading required package: ggthemes Loading required package: stringr Loading required package: qdapTools Attaching package: ‘qdapTools’ The following object is masked from ‘package:dplyr’: id The following object is masked from ‘package:data.table’: shift Loading required package: quanteda Package version: 3.2.4 Unicode version: 13.0 ICU version: 66.1 Parallel computing: 4 of 4 threads used. See https://quanteda.io for tutorials and examples. Attaching package: ‘quanteda’ The following object is masked from ‘package:tm’: stopwords The following objects are masked from ‘package:NLP’: meta, meta<- Loading required package: quanteda.textstats Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : “there is no package called ‘quanteda.textstats’” Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘nsyllable’, ‘proxyC’ Loading required package: quantmod Loading required package: xts Loading required package: zoo Attaching package: ‘zoo’ The following object is masked from ‘package:quanteda’: index The following objects are masked from ‘package:base’: as.Date, as.Date.numeric ################################### WARNING ################################### # We noticed you have dplyr installed. The dplyr lag() function breaks how # # base R's lag() function is supposed to work, which breaks lag(my_xts). # # # # Calls to lag(my_xts) that you enter or source() into this session won't # # work correctly. # # # # All package code is unaffected because it is protected by the R namespace # # mechanism. # # # # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. # # # # You can use stats::lag() to make sure you're not using dplyr::lag(), or you # # can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop # # dplyr from breaking base R's lag() function. # ################################### WARNING ################################### Attaching package: ‘xts’ The following objects are masked from ‘package:dplyr’: first, last The following objects are masked from ‘package:data.table’: first, last Loading required package: TTR Registered S3 method overwritten by 'quantmod': method from as.zoo.data.frame zoo Loading required package: tibble
Data source: https://www.kaggle.com/datasets/vidyapb/elon-musk-tweets-2015-to-2020
tesla = read.csv('/kaggle/input/elon-musk-tweets-2015-to-2020/elonmusk.csv')
head(tesla, 5)
id | conversation_id | created_at | date | time | timezone | user_id | username | name | place | ⋯ | geo | source | user_rt_id | user_rt | retweet_id | reply_to | retweet_date | translate | trans_src | trans_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <lgl> | ⋯ | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> | <chr> | <lgl> | <lgl> | <lgl> | <lgl> | |
1 | 1.282940e+18 | 1.282933e+18 | 1.594712e+12 | 2020-07-14 | 07:28:03 | UTC | 44196397 | elonmusk | Elon Musk | NA | ⋯ | NA | NA | NA | NA | NA | [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '1308211178', 'username': 'Teslarati'}] | NA | NA | NA | NA |
2 | 1.282845e+18 | 1.282802e+18 | 1.594689e+12 | 2020-07-14 | 01:10:26 | UTC | 44196397 | elonmusk | Elon Musk | NA | ⋯ | NA | NA | NA | NA | NA | [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '275295731', 'username': 'davidtayar5'}, {'user_id': '1046173159580479490', 'username': 'TeslaLisa'}, {'user_id': '999632398525530114', 'username': 'SteveHamel16'}, {'user_id': '1689516060', 'username': 'vincent13031925'}, {'user_id': '717042249624854529', 'username': 'S_Padival'}, {'user_id': '980554784133427200', 'username': 'BarkMSmeagol'}, {'user_id': '23771947', 'username': 'annerajb'}, {'user_id': '12153682', 'username': 'PJHORNAK'}, {'user_id': '2638809573', 'username': 'WPipperger'}, {'user_id': '316068880', 'username': 'EcoHeliGuy'}] | NA | NA | NA | NA |
3 | 1.282806e+18 | 1.282759e+18 | 1.594680e+12 | 2020-07-13 | 22:34:13 | UTC | 44196397 | elonmusk | Elon Musk | NA | ⋯ | NA | NA | NA | NA | NA | [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '319128454', 'username': 'katlinegrey'}] | NA | NA | NA | NA |
4 | 1.282800e+18 | 1.282672e+18 | 1.594678e+12 | 2020-07-13 | 22:12:52 | UTC | 44196397 | elonmusk | Elon Musk | NA | ⋯ | NA | NA | NA | NA | NA | [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '1689516060', 'username': 'vincent13031925'}] | NA | NA | NA | NA |
5 | 1.282800e+18 | 1.282739e+18 | 1.594678e+12 | 2020-07-13 | 22:12:26 | UTC | 44196397 | elonmusk | Elon Musk | NA | ⋯ | NA | NA | NA | NA | NA | [{'user_id': '44196397', 'username': 'elonmusk'}, {'user_id': '17217640', 'username': 'SpaceflightNow'}] | NA | NA | NA | NA |
nrow(tesla)
This code uses the dplyr package to perform data manipulation operations on the tesla data frame.
tesla %>% arrange("date", ascending=F) arranges the rows in the tesla data frame in descending order of date. This is achieved using the %>% operator to pipe the data frame into the arrange() function, which takes two arguments: "date" (the column to sort by) and ascending = F (indicating descending order).
%>% group_by(date) groups the rows in the tesla data frame by date, using the %>% operator to pipe the data frame into the group_by() function, which takes one argument: date (the column to group by).
%>% summarize(tweet = paste(tweet, collapse = " . ")) summarizes the tweet column for each group (i.e., each unique date) by concatenating all the tweets into a single string, separated by a period (.). This is achieved using the %>% operator to pipe the grouped data frame into the summarize() function, which takes one argument: tweet = paste(tweet, collapse = " . ") (a named argument indicating the column to summarize and the operation to apply).
tesla = tesla %>% arrange("date", ascending=F) %>%
group_by(date) %>%
summarize(tweet = paste(tweet, collapse = " . "))
tail(tesla, 5)
date | tweet |
---|---|
<chr> | <chr> |
2020-07-10 | I spoke with Korolev’s family today. He was one of the very best. Корольов / Королёв. . In general, we need to improve how podcasts play . SPQR https://m.youtube.com/watch?v=wjOfQfxmTLQ … . True haha . Death is the loss of information . Earning power post augmentation would easily pay for itself (if that’s even necessary). This is the best thing I can think of to ensure that collective human will decides the future. . Absolutely . 👀 . Probably a good one to design & engineer in Germany . Even more . Ok . 👀 . Berlin Model Y is the one to watch. That is a revolution in automotive body engineering (finally). . True . Wow, IHOP & GitHub are close . Best use of the term “Full Stack”? . For sure. This is both great & terrifying. Everything we’ve ever sensed or thought has been electrical signals. The early universe was just a soup of quarks & leptons. How did a very small piece of the Universe start to think of itself as sentient? . Yes . Sorry, should hopefully be soon! . Not actually a payout, just a vesting of stock options. It may never pay out, as the stock can’t be sold for 5 years after exercise. The stock must be bought & income taxes paid, then hold value for 5 years. . No easy way to answer this in a tweet, but helping with dire brain injuries is our first priority. Details Aug 28. . AI symbiosis while u wait |
2020-07-11 | Haha . Maybe he should design flag of Mars . 48 65 78 20 74 6f 20 74 65 78 74 . pic.twitter.com/1MQXFAKPzf |
2020-07-12 | Thanks :) . That is the near-term danger of AI . I didn’t mind DA2. DA1 was awesome. Mass Effect 2 (talking about sequels) is amazing. |
2020-07-13 | Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride! |
2020-07-14 | Cute . Wow |
This R code creates a corpus object, which is a collection of documents (in this case, tweets about Tesla) that will be used for text analysis.
The Corpus() function is from the tm package, and is used to create a corpus object from a source text (in this case, VectorSource(tesla$tweet)). The VectorSource() function is used to create a source vector from the tweet column of the tesla data frame, which contains the text of the tweets.
The corpus[[1]][1] line extracts the first word of the first document in the corpus.
# Extract corpus
corpus = Corpus(VectorSource(tesla$tweet))
corpus[[1]][1]
This R code is using the tm_map() function from the tm package to transform the text in a corpus to lowercase.
# Convert to lower case
corpus = tm_map(corpus, FUN = content_transformer(tolower))
corpus[[1]][1]
Warning message in tm_map.SimpleCorpus(corpus, FUN = content_transformer(tolower)): “transformation drops documents”
The content_transformer() function is being used to apply a regular expression pattern to each document in the corpus. The regular expression pattern is 'http[[:alnum:][:punct:]]*', which matches any sequence of characters that starts with 'http', followed by any combination of alphanumeric and punctuation characters. This pattern is being replaced with a space character using the gsub() function, which is a built-in R function for replacing text in a string.
The overall effect of this code is to remove any URLs or web links from the text documents in the corpus, replacing them with a space character.
# Remover URL
corpus = tm_map(corpus,
FUN = content_transformer(FUN = function(x)gsub(pattern = 'http[[:alnum:][:punct:]]*',
replacement = ' ',x = x)))
Warning message in tm_map.SimpleCorpus(corpus, FUN = content_transformer(FUN = function(x) gsub(pattern = "http[[:alnum:][:punct:]]*", : “transformation drops documents”
removePunctuation is a built-in function in the tm package that removes all punctuation marks from the text.
# Remove punctuation
corpus = tm_map(corpus,FUN = removePunctuation)
corpus[[1]][1]
Warning message in tm_map.SimpleCorpus(corpus, FUN = removePunctuation): “transformation drops documents”
The removeWords() function is applied to corpus using tm_map(). The removeWords() function is a text preprocessing function that takes a corpus as input and removes specified words from the text. In this case, stopwords('english') is used to generate a list of English stop words that are commonly used but do not carry much meaning on their own (such as "the", "and", "a", etc.). These stop words are then removed from the text in corpus.
The resulting corpus object will have the English stop words removed from its text.
corpus = tm_map(corpus,FUN = removeWords,c(stopwords('english')))
corpus[[1]][1]
Warning message in tm_map.SimpleCorpus(corpus, FUN = removeWords, c(stopwords("english"))): “transformation drops documents”
the stripWhitespace function is being passed as an argument to tm_map(). This function removes extra whitespace (i.e., spaces, tabs, line breaks) from the text in the corpus. The stripWhitespace function is a built-in function in the tm package.
By applying this function to the corpus variable, any extra whitespace will be removed from each document in the corpus.
corpus = tm_map(corpus,FUN = stripWhitespace)
corpus[[1]][1]
Warning message in tm_map.SimpleCorpus(corpus, FUN = stripWhitespace): “transformation drops documents”
Lemmatization is the process of reducing a word to its base or dictionary form, which can help to standardize text and improve the accuracy of text analysis.
The code first downloads a pre-trained language model for English using the udpipe_download_model function, and then loads the model using udpipe_load_model.
Next, the code defines a function called lemmatize_document which takes a single document as input and returns a vector of lemmas. The function tokenizes the document using the udpipe_annotate function, extracts the lemma column from the resulting data frame using as.data.frame, and removes numbers, punctuation, and other special characters from the lemmas using the gsub function.
Finally, the code applies the lemmatize_document function to each document in the corpus using the lapply function and stores the result in lemmatized_corpus. The resulting corpus is then printed using head.
# download the model to perform the lemmatization
model <- udpipe_download_model(language = "english")
# load the model
ud_model <- udpipe_load_model(file = model$file_model)
# create a function to lemmatize a single document
lemmatize_document <- function(document) {
# tokenize the document
tokens <- udpipe_annotate(ud_model, x = document)
# extract the lemma column from the tokens
lemmas <- as.data.frame(tokens)$lemma
# remove numbers, punctuation and other special characters
lemmas <- gsub("[^[:alpha:][:space:]]*", "", lemmas)
# return the lemmatized document
return(lemmas)
}
# apply lemmatization to each document in the corpus
lemmatized_corpus <- lapply(corpus, lemmatize_document)
# print the lemmatized corpus
head(lemmatized_corpus, 5)
Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /kaggle/working/english-ewt-ud-2.5-191206.udpipe - This model has been trained on version 2.5 of data from https://universaldependencies.org - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0 - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details. - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe') Downloading finished, model stored at '/kaggle/working/english-ewt-ud-2.5-191206.udpipe' Warning message in read_connlu(x, is_udpipe_annotation = TRUE, ...): “No parsed data in x$conllu, returning default empty data.frame. Error message at x$error indicates e.g.: ” Warning message in read_connlu(x, is_udpipe_annotation = TRUE, ...): “No parsed data in x$conllu, returning default empty data.frame. Error message at x$error indicates e.g.: ”
This code performs stemming on a previously lemmatized corpus of text using the SnowballC package in R. Stemming is the process of reducing a word to its root or base form, which can help to reduce the size of the vocabulary and improve text analysis.
The code first defines a function called stem_document which takes a single document as input and returns a vector of stems. The function uses the wordStem function from the SnowballC package to stem the words in the document, with the language parameter set to "english" for English language text.
Next, the code applies the stem_document function to each document in the previously lemmatized corpus using the lapply function and stores the result in stemmed_corpus.
# create a function to stem a single document
stem_document <- function(document) {
# stem the words in the document
stems <- wordStem(document, language = "english")
# return the stemmed document
return(stems)
}
# apply stemming to each document in the corpus
stemmed_corpus <- lapply(lemmatized_corpus, stem_document)
# print the stemmed corpus
head(stemmed_corpus, 5)
This code removes non-alpha characters from a previously stemmed corpus of text using the gsub function in R. The purpose of this step is to clean the text further and remove any remaining non-alphabetic characters that may interfere with subsequent text analysis.
The code first defines a function called clean_document which takes a single document as input and returns a cleaned version of the document. The function uses gsub to remove all characters that are not alphabets or white spaces from the document, with the regular expression "[^[:alpha:][:space:]]*" matching any character that is not a letter or whitespace.
Next, the code applies the clean_document function to each document in the previously stemmed corpus using the lapply function and stores the result in cleaned_corpus.
# Removing non-alpha characters
# create a function to remove non-alpha characters from a document
clean_document <- function(document) {
# remove all characters that are not alphabets or white spaces
clean_doc <- gsub("[^[:alpha:][:space:]]*", "", document)
# return the cleaned document
return(clean_doc)
}
# apply cleaning to each document in the lemmatized or stemmed corpus
cleaned_corpus <- lapply(stemmed_corpus, clean_document)
# print the cleaned corpus
head(cleaned_corpus, 5)
This code performs term frequency-inverse document frequency (TF-IDF) calculation on the cleaned corpus of text. TF-IDF is a widely used weighting scheme in text mining that assigns a weight to each term in a document based on its frequency within the document and its rarity across all documents.
The code first creates a document-term matrix (DTM) using the DocumentTermMatrix function from the tm package. The x parameter is set to the cleaned corpus, and the control parameter is set to a list containing normalize=TRUE, which scales the counts in the DTM by the total number of words in each document.
Next, the code removes sparse terms from the DTM using the removeSparseTerms function, which removes terms that appear in fewer than 5% of the documents. The resulting DTM is converted to a matrix and then to a data frame using the as.matrix and as.data.frame functions, respectively.
# TF-IDF
dtm_tfidf = DocumentTermMatrix(x=cleaned_corpus,
control = list(normalize=TRUE))
xdtm_tfidf = removeSparseTerms(dtm_tfidf,sparse = 0.95)
xdtm_tfidf = as.data.frame(as.matrix(xdtm_tfidf))
sort(colSums(xdtm_tfidf),decreasing = T)
head(xdtm_tfidf, 10)
"can", | "test", | "", | "almost", | "give", | "go", | "good", | "launch", | "m", | "much", | ⋯ | "improv", | "make", | "engin", | "sound", | "earth", | "order", | "exact", | "part", | "pretti", | "starship", | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 4 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 2 | 0 | 1 | 0 | 3 | 1 | 1 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tesla_tfidf = cbind(tesla, xdtm_tfidf)
head(tesla_tfidf, 3)
date | tweet | "can", | "test", | "", | "almost", | "give", | "go", | "good", | "launch", | ⋯ | "improv", | "make", | "engin", | "sound", | "earth", | "order", | "exact", | "part", | "pretti", | "starship", | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2015-02-08 | Prob good though. Will give us time to replace 1st stage video transmitter (not needed for launch, but nice to have). . Air Force tracking radar went down. Launch postponed to same time tomorrow. . Rocket reentry will be much tougher this time around due to deep space mission. Almost 2X force and 4X heat. Plenty of hydraulic fluid tho. . Launching our 1st deep space mission today. Headed to Earth-Sun L1 gravity null point at 1M miles, 4X further than moon. | 0 | 0 | 4 | 1 | 1 | 1 | 1 | 3 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2015-02-10 | Launch postponed to tomorrow due to high winds at the Cape, but Dragon still inbound from orbit in 90 mins . Extreme wind shear over Cape Canaveral. Feels like a sledgehammer when supersonic in the vertical. Hoping it changes … . "What Are The Civilian Applications?" https://m.youtube.com/watch?v=M8YjvHYbZ9w … | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This code performs sentiment analysis on a collection of tweets related to Tesla using the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool. The code loops through each tweet in the "tesla" data frame and applies the get_vader function to extract the sentiment scores.
The sentiment scores are then appended to empty lists for each score type, including "word_scores", "compound", "pos", "neu", "neg", and "but_count". The length() function is used to determine the index of the next available spot in each list.
After processing all tweets, the code combines the lists into a data frame named vader_df using the data.frame() function. The unlist() function is used to flatten the lists into a single vector for each score type. The resulting data frame contains one row per tweet and columns for each score type.
# Create empty lists
word_scores_list = list()
compound_list = list()
pos_list = list()
neu_list = list()
neg_list = list()
but_count_list = list()
# Loop through tweets
for (row in tesla$tweet) {
# Get sentiment scores
sentiment = get_vader(row, incl_nt = T, neu_set = T, rm_qm = T)
# Append scores to lists
word_scores_list[[length(word_scores_list) + 1]] = sentiment["word_scores"]
compound_list[[length(compound_list) + 1]] = sentiment["compound"]
pos_list[[length(pos_list) + 1]] = sentiment["pos"]
neu_list[[length(neu_list) + 1]] = sentiment["neu"]
neg_list[[length(neg_list) + 1]] = sentiment["neg"]
but_count_list[[length(but_count_list) + 1]] = sentiment["but_count"]
}
# Combine lists into data frame
vader_df = data.frame(
word_scores = unlist(word_scores_list),
compound = unlist(compound_list),
pos = unlist(pos_list),
neu = unlist(neu_list),
neg = unlist(neg_list),
but_count = unlist(but_count_list)
)
colnames(vader_df)
tesla_sentiment = cbind(tesla_tfidf, vader_df)
print(nrow(tesla_sentiment))
head(tesla_sentiment, 5)
[1] 1360
date | tweet | "can", | "test", | "", | "almost", | "give", | "go", | "good", | "launch", | ⋯ | "exact", | "part", | "pretti", | "starship", | word_scores | compound | pos | neu | neg | but_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.318 | 0.141 | 0.859 | 0 | 0 |
2 | 2015-02-08 | Prob good though. Will give us time to replace 1st stage video transmitter (not needed for launch, but nice to have). . Air Force tracking radar went down. Launch postponed to same time tomorrow. . Rocket reentry will be much tougher this time around due to deep space mission. Almost 2X force and 4X heat. Plenty of hydraulic fluid tho. . Launching our 1st deep space mission today. Headed to Earth-Sun L1 gravity null point at 1M miles, 4X further than moon. | 0 | 0 | 4 | 1 | 1 | 1 | 1 | 3 | ⋯ | 0 | 0 | 0 | 0 | {0, 0.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.67 | 0.088 | 0.887 | 0.025 | 1 |
3 | 2015-02-10 | Launch postponed to tomorrow due to high winds at the Cape, but Dragon still inbound from orbit in 90 mins . Extreme wind shear over Cape Canaveral. Feels like a sledgehammer when supersonic in the vertical. Hoping it changes … . "What Are The Civilian Applications?" https://m.youtube.com/watch?v=M8YjvHYbZ9w … | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | {0, -0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.25, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.785 | 0.139 | 0.835 | 0.027 | 1 |
4 | 2015-02-11 | Rocket soft landed in the ocean within 10m of target & nicely vertical! High probability of good droneship landing in non-stormy weather. . Primary mission on target. Spacecraft head towards the sun! All good there. . @DanielLockyer We could actually do that...maybe we should . Planning a significant upgrade of the droneship for future missions to handle literally anything. Maybe give it a Merlin for good measure :) . Can't delay any longer. Must proceed with primary mission to launch the Deep Space Climate Observatory spacecraft. . Mega storm preventing droneship from remaining on station, so rocket will try to land on water. Survival probability <1%. . Coming home pic.twitter.com/FmrmYs6R6V . Dragon splashdown off the California coast pic.twitter.com/4Bvfmei8I3 | 0 | 0 | 2 | 0 | 1 | 0 | 3 | 1 | ⋯ | 0 | 0 | 0 | 0 | {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.9, 0, 2, 0, 0, 0.962, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.95 | 0.146 | 0.846 | 0.008 | 0 |
5 | 2015-02-12 | Landing on a stormy sea pic.twitter.com/7EY25g3IU5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | {0, 0, 0, 0, 0, 0} | 0 | 0 | 1 | 0 | 0 |
write.csv(tesla_sentiment, 'pre_finance.csv')
This code is for downloading stock price data for Tesla from Yahoo Finance within a specified date range.
First, the stock symbol "TSLA" is defined and the start and end dates are specified using as.Date().
getSymbols() is then used to download the historical stock price data for TSLA from Yahoo Finance, with the data source specified as "yahoo" and the date range specified using the from and to arguments.
Next, Ad() is used to extract the adjusted close price data from the downloaded data. as.data.frame() is used to convert the data into a dataframe format, and rownames_to_column() is used to move the date information from the row names to a separate "date" column in the dataframe. The resulting dataframe is assigned to tsla_prices.
# define the stock symbol and data range
symbol <- "TSLA"
start_date <- as.Date("2015-01-29")
end_date <- as.Date("2020-07-15")
# get the stock price data from Yahoo Finance
getSymbols(symbol, src = "yahoo", from = start_date, to = end_date)
# extract the adjusted close price data
tsla_prices <- Ad(get(symbol))
tsla_prices <- as.data.frame(tsla_prices) %>%
rownames_to_column(var = "date")
# view the last 10 prices
tail(tsla_prices)
date | TSLA.Adjusted | |
---|---|---|
<chr> | <dbl> | |
1369 | 2020-07-07 | 92.65733 |
1370 | 2020-07-08 | 91.05867 |
1371 | 2020-07-09 | 92.95200 |
1372 | 2020-07-10 | 102.97667 |
1373 | 2020-07-13 | 99.80400 |
1374 | 2020-07-14 | 101.12000 |
This code converts the "date" column in the tesla_sentiment and tsla_prices dataframes from character or factor type to Date type using as.Date().
This is useful when combining or merging dataframes on the basis of the "date" column, as it ensures that the dates are in the same format and can be compared properly.
tesla_sentiment$date = as.Date(tesla_sentiment$date)
tsla_prices$date = as.Date(tsla_prices$date)
tail(tesla_sentiment, 3)
date | tweet | "can", | "test", | "", | "almost", | "give", | "go", | "good", | "launch", | ⋯ | "exact", | "part", | "pretti", | "starship", | word_scores | compound | pos | neu | neg | but_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1358 | 2020-07-12 | Thanks :) . That is the near-term danger of AI . I didn’t mind DA2. DA1 was awesome. Mass Effect 2 (talking about sequels) is amazing. | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | {1.9, 2, 0, 0, 0, 0, 0, -2.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.1, 0, 0, 0, 0, 0, 0, 0, 2.8} | 0.886 | 0.361 | 0.55 | 0.089 | 0 |
1359 | 2020-07-13 | Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride! | 0 | 1 | 7 | 0 | 0 | 0 | 1 | 1 | ⋯ | 0 | 1 | 0 | 0 | {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0} | 0.973 | 0.18 | 0.742 | 0.079 | 1 |
1360 | 2020-07-14 | Cute . Wow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | {2, 0, 2.8} | 0.778 | 0.872 | 0.128 | 0 | 0 |
colnames(tesla_sentiment)
This code performs the following tasks:
It merges two data frames, tesla_sentiment and tsla_prices, on the "date" column, using merge() function. The argument all.y=TRUE ensures that all the rows from the tsla_prices data frame are included in the merged data frame.
It adds a new column called "increase" to the merged data frame. This column contains binary values indicating whether the stock price increased or not on a particular day. This will be our target variable.
It sets the NA values in columns 3 to ncol(merged_df)-1 (excluding the "date", "TSLA.Open", and "TSLA.Adjusted" columns) to 0. This is done using is.na() and [<-.data.frame() functions.
merged_df = merge(tesla_sentiment, tsla_prices, by="date", all.y = TRUE)
# Add increase column
merged_df$increase <- c(0, ifelse(diff(merged_df$TSLA.Adjusted) > 0, 1, 0))
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
tail(merged_df, 2)
date | tweet | "can", | "test", | "", | "almost", | "give", | "go", | "good", | "launch", | ⋯ | "pretti", | "starship", | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1373 | 2020-07-13 | Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride! | 0 | 1 | 7 | 0 | 0 | 0 | 1 | 1 | ⋯ | 0 | 0 | {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0} | 0.973 | 0.18 | 0.742 | 0.079 | 1 | 99.804 | 0 |
1374 | 2020-07-14 | Cute . Wow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | {2, 0, 2.8} | 0.778 | 0.872 | 0.128 | 0 | 0 | 101.120 | 1 |
write.csv(merged_df, file="tesla_sentiment.csv", row.names = F)
tesla_sentiment = read.csv('/kaggle/input/pre-data-visualization/tesla_sentiment.csv')
head(tesla_sentiment, 2)
date | tweet | X.can.. | X.test.. | X... | X.almost.. | X.give.. | X.go.. | X.good.. | X.launch.. | ⋯ | X.pretti.. | X.starship.. | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> | |
1 | 2015-01-29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 13.68000 | 0 |
2 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.318 | 0.141 | 0.859 | 0 | 0 | 13.57333 | 0 |
This code block is for data cleaning, and it is processing a data frame named tesla_sentiment. Here are the details for each line:
The first line uses regular expression to remove 'X.' and '..' from the column names of tesla_sentiment and stores the modified column names in a variable called new_col_names. The second line assigns the new column names to the data frame tesla_sentiment using the colnames function. The third line selects all columns from tesla_sentiment that are not equal to "", "m", "x", ".1", and "s" using the negation operator ! and %in% function, and assigns the resulting data frame to a new variable called final_data_new. The head function is then used to display the first five rows of the cleaned data frame.
# Remove 'X.' and '..' from column names
new_col_names = gsub('X\\.|\\.\\.', '', colnames(tesla_sentiment))
# Assign new column names to the data frame
colnames(tesla_sentiment) <- new_col_names
# Drop columns "", "m", "x", ".1", and "s"
final_data_new <- tesla_sentiment[, !names(tesla_sentiment) %in% c('', 'm', 'x', '.1', 's')]
head(final_data_new, 5)
date | tweet | can | test | almost | give | go | good | launch | much | ⋯ | pretti | starship | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> | |
1 | 2015-01-29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 13.68000 | 0 |
2 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.318 | 0.141 | 0.859 | 0 | 0 | 13.57333 | 0 |
3 | 2015-02-02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.06267 | 1 |
4 | 2015-02-03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.55733 | 1 |
5 | 2015-02-04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.57000 | 1 |
This code creates a histogram visualization of the "increase" column in the "final_data_new" data frame. The "increase" column contains binary values (0 or 1) that represent whether the stock price increased or decreased on a particular day.
The code uses the ggplot2 library to create the histogram. The "data" parameter specifies the data frame to use. The "aes" function specifies the x-axis variable, which is the "increase" column.
The "geom_histogram" function creates the histogram, with the "fill" parameter setting the fill color of the bars.
The "theme_bw" function sets the theme of the plot to a black and white style.
The "xlab" function sets the label for the x-axis.
Finally, the "coord_flip" function is used to flip the x and y axes of the plot. This is done to make the plot horizontal, with the bars running from left to right rather than from bottom to top.
#visualizing increase vs decrease
ggplot(data=final_data_new,aes(x=increase))+
geom_bar(fill=c('orange', 'brown'))+
theme_bw()+
xlab('Stock Increase(0) or Decrease(1)') +
coord_flip()
final_data_new_visual <- final_data_new[, !names(final_data_new) %in% c('word_scores', 'tweet','date','compound', 'pos', 'neu', 'neg', 'but_count', 'TSLA.Adjusted', 'increase')]
head(final_data_new_visual, 3)
can | test | almost | give | go | good | launch | much | need | point | ⋯ | improv | make | engin | sound | earth | order | exact | part | pretti | starship | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This code visualizes the top 25 words appearing in Elon Musk's tweets.
col_sums calculates the sum of each column in final_data_new_visual, which contains the frequency of each word used in the tweets.
final_data_new_visual_sums creates a new data frame containing two columns: word and value, where word is the name of the word and value is the sum of its frequency in the tweets.
final_data_new_visual_sums_top25 creates a subset of final_data_new_visual_sums containing only the top 25 words sorted by their frequency in descending order.
Finally, ggplot is used to create a bar plot with the x-axis representing the words and the y-axis representing the sum of their frequency. The reorder function is used to sort the words on the x-axis by their value in descending order. The theme function is used to rotate the x-axis labels by 90 degrees, and coord_flip is used to switch the axes of the plot. The bars are filled with the color "sienna3".
#visualizing top 25 words appearing in Elon's Tweet
col_sums <- colSums(final_data_new_visual)
final_data_new_visual_sums <- data.frame(
word = names(col_sums),
value = col_sums
)
final_data_new_visual_sums_top25 <- final_data_new_visual_sums[order(final_data_new_visual_sums$value, decreasing = TRUE),][1:25,]
# visualize the results with a bar plot
ggplot(data = final_data_new_visual_sums_top25, aes(x = reorder(word, value), y = value)) +
geom_bar(stat = "identity", fill = "sienna3") +
labs(x = "Word", y = "Sum of Values") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
This code performs some date manipulations and summarizes the data on a yearly basis.
The lubridate library is loaded, which provides some useful functions for working with dates. Then, the as.Date() function is used to convert the date column of the final_data_new data frame to a date format. The quarter() and year() functions from lubridate are then used to create new columns quarter and year, which extract the quarter and year information from the date column.
Next, the %>% operator from the dplyr package is used to pipe the final_data_new data frame to a series of operations. The mutate() function creates a new column year, which extracts the year information from the date column. The data is then grouped by year using the group_by() function, and the summarise() function is used to calculate the mean values of the pos, neu, compound, neg, and TSLA.Adjusted columns for each year. The resulting data frame df_yearly contains the summarized data.
library(lubridate)
final_data_new$date <- as.Date(final_data_new$date, format = "%Y-%m-%d")
final_data_new$quarter <- quarter(final_data_new$date)
final_data_new$year <- year(final_data_new$date)
df_yearly <- final_data_new %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise(pos = mean(pos), neu = mean(neu), compound = mean(compound), neg = mean(neg), TSLA.Adjusted = mean(TSLA.Adjusted))
head(final_data_new, 5)
date | tweet | can | test | almost | give | go | good | launch | much | ⋯ | starship | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> | <int> | |
1 | 2015-01-29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 13.68000 | 0 | 1 |
2 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.318 | 0.141 | 0.859 | 0 | 0 | 13.57333 | 0 | 1 |
3 | 2015-02-02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.06267 | 1 | 1 |
4 | 2015-02-03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.55733 | 1 | 1 |
5 | 2015-02-04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 14.57000 | 1 | 1 |
This code creates two plots side by side, showing the sentiment scores over time and the TSLA adjusted stock prices over time.
The first plot is created using ggplot and shows four lines, each representing a different sentiment score (positive, neutral, compound, and negative) over the years in the dataset. The x-axis represents the year, and the y-axis represents the sentiment score. The color of each line is manually set using scale_color_manual, and the plot is given a title and a y-axis label.
The second plot is also created using ggplot and shows a single line representing the TSLA adjusted stock price over the years in the dataset. The x-axis and y-axis represent the year and the adjusted stock price, respectively. The plot is given a title and a y-axis label.
The two plots are combined side by side using the grid.arrange() function from the gridExtra package. The ncol argument specifies the number of columns in the resulting grid, and the widths argument specifies the relative widths of the columns.
#Sentiment Score over time vs Stock Price
# Create the first plot for sentiment scores
plot1 <- ggplot(data = df_yearly, aes(x = year)) +
geom_line(aes(y = pos, color = "Positive")) +
geom_line(aes(y = neu, color = "Neutral")) +
geom_line(aes(y = compound, color = "Compound")) +
geom_line(aes(y = neg, color = "Negative")) +
ylab("Sentiment Score") +
ggtitle("Sentiment Scores Over Time") +
scale_color_manual(values = c("Positive" = "blue", "Neutral" = "grey",
"Compound" = "black", "Negative" = "red"))
# Create the second plot for TSLA adjusted stock prices
plot2 <- ggplot(data = df_yearly, aes(x = year)) +
geom_line(aes(y = TSLA.Adjusted)) +
ylab("Adjusted Stock Price") +
ggtitle("TSLA Adjusted Stock Prices Over Time")
# Combine the two plots side by side using grid.arrange() from the gridExtra package
library(gridExtra)
grid.arrange(plot1, plot2, ncol = 2,widths=c(2/3, 1/3))
This code is visualizing the top positive and negative words in the final_data_new_visual_sums data frame by joining it with the NRC sentiment dataset using inner_join() function from the tidytext package. The resulting dataset is grouped by sentiment. However, there is an error in the code. The bing_sentiment data frame is not defined anywhere, and so filtering it by sentiment does not make sense.
After that, the code selects the top 25 rows for both positive and negative sentiments, orders them in descending order by value, and creates bar plots for each using ggplot2 package. Finally, the two plots are combined side by side using grid.arrange() function from the gridExtra package.
#Visualizing top positive words and top negative words
library(tidytext)
bing <- read.csv("/kaggle/input/bing-lexicon/bing-2.csv")
bing_sentiment <- final_data_new_visual_sums%>%
inner_join(bing)%>%
group_by(sentiment)
# filter to only include positive and negative sentiment
bing_sentiment_pos <- bing_sentiment[bing_sentiment$sentiment == "positive", ]
bing_sentiment_neg <- bing_sentiment[bing_sentiment$sentiment == "negative", ]
# arrange in descending order by value
bing_sentiment_pos <- bing_sentiment_pos[order(-bing_sentiment_pos$value), ]
bing_sentiment_neg <- bing_sentiment_pos[order(-bing_sentiment_neg$value), ]
# select top 25 rows
bing_sentiment_pos_top25 <- head(bing_sentiment_pos, 25)
bing_sentiment_neg_top25 <- head(bing_sentiment_neg, 25)
# Define color palette
my_colors <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")
# Create bar plot for positive words with custom colors
plot_pos <- ggplot(bing_sentiment_pos_top25, aes(x = reorder(word, value), y = value, fill = word)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = my_colors[1:length(unique(bing_sentiment_pos_top25$word))]) +
labs(title = "Top 25 Positive Words", x = "Word", y = "Value") +
coord_flip()
# Create bar plot for negative words with custom colors
plot_neg <- ggplot(bing_sentiment_neg_top25, aes(x = reorder(word, value), y = value, fill = word)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = my_colors[1:length(unique(bing_sentiment_neg_top25$word))]) +
labs(title = "Top 25 Negative Words", x = "Word", y = "Value")
# Combine the two plots side by side using grid.arrange() from the gridExtra package
library(gridExtra)
grid.arrange(plot_pos, plot_neg, ncol = 2, widths=c(1/2, 1/2))
Joining with `by = join_by(word)`
his code creates a new column in the final_data dataframe called emojis. The str_extract_all() function from the stringr package is used to extract all the emojis from the tweet column of the final_data dataframe.
final_data_new$emojis <- str_extract_all(final_data_new$tweet, "[\U0001F300-\U0001F64F]|\U0001F680|\U0001F6A5|\U0001F6B2|\U0001F30D|\U0001F30E|\U0001F31E|\U0001F363|\U0001F377|\U0001F37B|\U0001F41F|\U0001F42C|\U0001F355|\U0001F419|\U0001F680|\u2600-\u26FF]")
tail(final_data_new, 2)
date | tweet | can | test | almost | give | go | good | launch | much | ⋯ | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | quarter | emojis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> | <int> | <list> | |
1373 | 2020-07-13 | Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride! | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | ⋯ | {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.5, 0, 0, 0, 1.4, 0, 0, 0, 0, 1, 0, 0, 0.55, 0, 0, 1.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, -0.3, 0, 3, 0, 2.55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.45, 0, 0, 0, 0, 0, 0, 0, 3.2895, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 4.05, 0, 2.55, 0, 0, 0, 0, 0, 0, 0.15, 0, 0, -1.65, 0, 0, 0, 0, -1.8, 0, 0, 0, 0, 0, -1.65, 0, 0, 0, 0, 0, 0, 0, 0, 1.95, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -5.4, 0, 0, 0} | 0.973 | 0.180 | 0.742 | 0.079 | 1 | 99.804 | 0 | 3 | 👀 |
1374 | 2020-07-14 | Cute . Wow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | {2, 0, 2.8} | 0.778 | 0.872 | 0.128 | 0.000 | 0 | 101.120 | 1 | 3 |
This code adds a new column called n_emojis to the final_data data frame. It does so by using the sapply() function to apply a function called len_emojis() to each element of the emojis column in final_data.
The len_emojis() function takes a column as its input and returns the number of emojis in that column. It does this by calculating the length of the column.
Before calculating the number of emojis in each tweet, the code replaces any missing values in the emojis column with 0 using the replace_na() function from the tidyr package. This ensures that the len_emojis() function doesn't return an error when it encounters a tweet with no emojis.
final_data_new$emojis = replace_na(final_data_new$emojis, 0)
len_emojis = function(col){
n_emojis = length(col)
return(n_emojis)
}
final_data_new$n_emojis = sapply(final_data_new$emojis, FUN=len_emojis)
tail(final_data_new)
date | tweet | can | test | almost | give | go | good | launch | much | ⋯ | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase | quarter | emojis | n_emojis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> | <int> | <list> | <int> | |
1369 | 2020-07-07 | If you get past Mars, the asteroids, moons of Jupiter & Saturn, inevitably you reach Uranus! . 🖤✨Carl Sagan ✨🖤 . Essentially. Long-term purpose of my Tesla stock is to help make life multiplanetary to ensure it’s continuance. The massive capital needs are in 10 to 20 years. By then, if we’re fortunate, Tesla’s goal of accelerating sustainable energy & autonomy will be mostly accomplished. | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0.912 | 0.198 | 0.802 | 0.000 | 0 | 92.65733 | 1 | 3 | 🖤, 🖤 | 2 |
1370 | 2020-07-08 | Tesla China team is awesome! . Words are a very lossy compression of thought | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0.440 | 0.233 | 0.635 | 0.132 | 0 | 91.05867 | 0 | 3 | 0 | |
1371 | 2020-07-09 | There’s some of that too . True, it sounds so surreal, but the negative propaganda is still all out there & easy to find in social media & press interviews, so it’s not just our imagination! . Make sure to read ur terms & conditions before clicking accept! . Samwise Gamgee . Altho Dumb and Dumber is 🔥🔥 . Progress update August 28 . Sure . If you can’t beat em, join em Neuralink mission statement | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0.708 | 0.216 | 0.635 | 0.149 | 1 | 92.95200 | 1 | 3 | 🔥, 🔥 | 2 |
1372 | 2020-07-10 | I spoke with Korolev’s family today. He was one of the very best. Корольов / Королёв. . In general, we need to improve how podcasts play . SPQR https://m.youtube.com/watch?v=wjOfQfxmTLQ … . True haha . Death is the loss of information . Earning power post augmentation would easily pay for itself (if that’s even necessary). This is the best thing I can think of to ensure that collective human will decides the future. . Absolutely . 👀 . Probably a good one to design & engineer in Germany . Even more . Ok . 👀 . Berlin Model Y is the one to watch. That is a revolution in automotive body engineering (finally). . True . Wow, IHOP & GitHub are close . Best use of the term “Full Stack”? . For sure. This is both great & terrifying. Everything we’ve ever sensed or thought has been electrical signals. The early universe was just a soup of quarks & leptons. How did a very small piece of the Universe start to think of itself as sentient? . Yes . Sorry, should hopefully be soon! . Not actually a payout, just a vesting of stock options. It may never pay out, as the stock can’t be sold for 5 years after exercise. The stock must be bought & income taxes paid, then hold value for 5 years. . No easy way to answer this in a tweet, but helping with dire brain injuries is our first priority. Details Aug 28. . AI symbiosis while u wait | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ⋯ | 0.964 | 0.147 | 0.802 | 0.051 | 1 | 102.97667 | 1 | 3 | 👀, 👀 | 2 |
1373 | 2020-07-13 | Reusability is essential. A rocket that is single use is just as absurd as a single use airplane. F9 engines already fire 3 times per flight. . Wild times! . We’re being extra paranoid. Maximizing probability of successful launch is paramount. . Welcome anytime . Well, I do care very much about sustainability of civilization, but there is some truth to the irony part haha . Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year. . 👀 . I think so . Doing range testing now. Number will be significantly higher than 300. Extremely good for any EV, especially an SUV. . We have reduced pricing on Model Y LR dual motor & will offer a LR single motor Y in a few months, which improves affordability, while still keeping the product excellent . Yes . It may be able to reach 250kW at low states of charge . No, as range would be unacceptably low (< 250 mile EPA) . We had to increase some wire thicknesses in S/X to reduce resistive heating. Technically, won’t be quite 1000 mph charging, as X especially is much bigger than 3. . Hell of a ride! | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | ⋯ | 0.973 | 0.180 | 0.742 | 0.079 | 1 | 99.80400 | 0 | 3 | 👀 | 1 |
1374 | 2020-07-14 | Cute . Wow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0.778 | 0.872 | 0.128 | 0.000 | 0 | 101.12000 | 1 | 3 | 0 |
# Correlation between number of emojis used and stock price
cor(final_data_new$TSLA.Adjusted, final_data_new$n_emojis)
This code creates a table of counts for each emoji present in the final_data dataset using the mtabulate function from the qdapTools package.
The resulting table is assigned to emoji_table, and the column names are modified to add a prefix "emoji_" to each column name using the paste0 function.
Finally, d is created by binding emoji_table and the TSLA.Adjusted column of final_data together using the cbind function. This new dataset d contains the count of each emoji and the corresponding TSLA stock price.
emoji_table <- mtabulate(final_data_new$emojis)
colnames(emoji_table) <- paste0("emoji_", colnames(emoji_table))
d = cbind(emoji_table, final_data_new$TSLA.Adjusted)
tail(d, 20)
emoji_🌃 | emoji_🌈 | emoji_🌌 | emoji_🌏 | emoji_🌪 | emoji_🌸 | emoji_🍀 | emoji_🍁 | emoji_🍂 | emoji_🍃 | ⋯ | emoji_😘 | emoji_😛 | emoji_😜 | emoji_😢 | emoji_😮 | emoji_😲 | emoji_😴 | emoji_🙏 | emoji_🚀 | final_data_new$TSLA.Adjusted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <dbl> | |
1355 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 65.47533 |
1356 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 66.11933 |
1357 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 66.93066 |
1358 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 66.72667 |
1359 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 66.28800 |
1360 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 66.78533 |
1361 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 64.05666 |
1362 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 65.73200 |
1363 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 63.98267 |
1364 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67.29000 |
1365 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 71.98734 |
1366 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 74.64200 |
1367 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 80.57733 |
1368 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 91.43867 |
1369 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92.65733 |
1370 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 91.05867 |
1371 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 92.95200 |
1372 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 102.97667 |
1373 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 99.80400 |
1374 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 101.12000 |
# select first and last column
df_new <- d %>% select(col1=colnames(d)[1], price='final_data_new$TSLA.Adjusted') %>%
filter(col1 != 0) %>%
summarise(avg_price = mean(price))
df_new[1,1]
This code creates an empty list called price_by_emoji and initializes a vector col_names with the names of the columns in the emoji_table data frame. It then loops through each column name, creates a new data frame df_new by selecting the column with the given name from emoji_table and the TSLA.Adjusted column from final_data, filtering out any rows where the value of the column is 0, and then calculating the mean of the TSLA.Adjusted column. The result is then assigned to an element in the price_by_emoji list with the same name as the column. So, this code calculates the average price for each emoji and stores it in a list called price_by_emoji.
price_by_emoji = list()
col_names = colnames(emoji_table)
for (col_name in col_names) {
df_new <- d %>% select(col1=col_name, price='final_data_new$TSLA.Adjusted') %>%
filter(col1 != 0) %>%
summarise(avg_price = mean(price))
price_by_emoji[[col_name]] = df_new[1,1]
}
Warning message:
“Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(col_name)
# Now:
data %>% select(all_of(col_name))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.”
df <- data.frame(emoji = names(price_by_emoji), avg_stock_price = unlist(price_by_emoji))
df %>% arrange(desc(avg_stock_price)) %>% select(-1)
avg_stock_price | |
---|---|
<dbl> | |
emoji_👀 | 84.11845 |
emoji_👸 | 52.17200 |
emoji_🔭 | 47.96400 |
emoji_👆 | 46.75467 |
emoji_💯 | 42.75367 |
emoji_🕺 | 42.72067 |
emoji_🏻 | 41.62100 |
emoji_😅 | 40.66311 |
emoji_😮 | 38.20000 |
emoji_💕 | 37.68836 |
emoji_💘 | 36.74433 |
emoji_🍻 | 36.48000 |
emoji_🔥 | 34.06458 |
emoji_🐈 | 33.29817 |
emoji_😢 | 33.23467 |
emoji_👶 | 32.08933 |
emoji_🎶 | 31.43717 |
emoji_😎 | 29.15167 |
emoji_🎄 | 28.72933 |
emoji_👻 | 28.43200 |
emoji_👍 | 28.11459 |
emoji_🎥 | 27.88867 |
emoji_🖤 | 24.82805 |
emoji_🌏 | 24.75355 |
emoji_🎁 | 24.66133 |
emoji_🍜 | 24.38067 |
emoji_😀 | 23.72745 |
emoji_😉 | 23.62442 |
emoji_🌃 | 23.50333 |
emoji_🍒 | 23.32867 |
⋮ | ⋮ |
emoji_🐿 | 19.45400 |
emoji_🎼 | 19.42800 |
emoji_🐉 | 19.25767 |
emoji_👌 | 19.23480 |
emoji_🍃 | 18.89433 |
emoji_🙏 | 18.84578 |
emoji_🚀 | 18.60696 |
emoji_😜 | 18.44567 |
emoji_👽 | 18.43967 |
emoji_🐌 | 18.16400 |
emoji_😲 | 17.86378 |
emoji_🐶 | 17.54933 |
emoji_😇 | 17.52533 |
emoji_😔 | 17.44333 |
emoji_🍁 | 17.24400 |
emoji_🍂 | 17.24400 |
emoji_🐏 | 17.04533 |
emoji_🐐 | 17.00200 |
emoji_🍷 | 16.90267 |
emoji_👁 | 16.44000 |
emoji_🎤 | 16.30200 |
emoji_🕳 | 16.30200 |
emoji_💣 | 16.23267 |
emoji_💦 | 16.23267 |
emoji_🏴 | 15.46333 |
emoji_😋 | 15.18667 |
emoji_🍀 | 15.16900 |
emoji_💰 | 14.79067 |
emoji_😐 | 14.06867 |
emoji_🐣 | 12.34400 |
This code is used to create a bar chart of the top 40 most frequent 2-word combinations (2 grams) in a corpus of text data.
First, the text data is processed using the tokens() function from the quanteda package to create a Tokens object from a stemmed corpus. Then, the tokens_ngrams() function is used to create an object that contains all 2-grams in the Tokens object.
Next, the dfm() function from the quanteda package is used to create a Document Feature Matrix (DFM) from the 2-gram Tokens object. This DFM is then used to calculate the frequency of each 2-gram using the textstat_frequency() function, which returns a data frame of the 2-gram features and their frequencies.
Finally, the top 40 most frequent 2-grams are plotted using the ggplot2 package. The reorder() function is used to order the 2-grams by frequency, and the geom_col() function is used to create a bar chart.
# 2-word frequency (2 gram)
# create Tokens object from stemmed corpus
tokens_stemmed <- tokens(stemmed_corpus)
dfm2 = dfm(tokens_ngrams(tokens_stemmed,n=2))
dfFreq2 = textstat_frequency(dfm2)
library(ggplot2)
ggplot(dfFreq2[1:40,], aes(x=reorder(feature, frequency), y=frequency)) +
geom_col() +
coord_flip() +
scale_x_discrete(name = "2 gram") +
theme(text=element_text(size=12))
#2-grams features to data frame
dtm2 <- convert(dfm2, to = "tm")
data_dtm2 = as.data.frame(as.matrix(dtm2))
head(data_dtm2)
curious_p | p_would | would_can | can_schedul | schedul_test | test_drive | prob_good | good_though | though_give | give_we | ⋯ | heat_technic | technic_will | quit_mph | mph_charg | charg_x | x_especi | especi_much | bigger_hell | hell_ride | cute_wow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
text1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This code is creating a frequency distribution of 3-grams in the text data.
The tokens_ngrams function from the quanteda package is used to create a Tokens object from the stemmed corpus with n-grams of length 3.
The dfm function is then used to create a document-feature matrix from the Tokens object.
The textstat_frequency function is used to create a frequency table of the 3-grams, which is stored in dfFreq3.
Finally, the ggplot2 package is used to create a bar plot of the top 40 most frequent 3-grams.
# 3-word frequency (3 gram)
# create Tokens object from stemmed corpus
dfm3 = dfm(tokens_ngrams(tokens_stemmed,n=3))
dfFreq3 = textstat_frequency(dfm3)
ggplot(dfFreq3[1:40,], aes(x=reorder(feature, frequency), y=frequency)) +
geom_col() +
coord_flip() +
scale_x_discrete(name = "3 gram") +
theme(text=element_text(size=12))
#3-grams features to data frame
dtm3 <- convert(dfm3, to = "tm")
data_dtm3 = as.data.frame(as.matrix(dtm3))
head(data_dtm3)
curious_p_would | p_would_can | would_can_schedul | can_schedul_test | schedul_test_drive | prob_good_though | good_though_give | though_give_we | give_we_time | we_time_replac | ⋯ | technic_will_not | will_not_quit | not_quit_mph | quit_mph_charg | mph_charg_x | charg_x_especi | x_especi_much | especi_much_bigger | much_bigger_hell | bigger_hell_ride | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
text1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Merging datasets of ngrams
tesla_sentiment = read.csv('/kaggle/input/pre-finance/pre_finance.csv')
ngrams_data = cbind(tesla_sentiment, data_dtm2, data_dtm3)
head(ngrams_data, 2)
X | date | tweet | X.can.. | X.test.. | X... | X.almost.. | X.give.. | X.go.. | X.good.. | ⋯ | technic_will_not | will_not_quit | not_quit_mph | quit_mph_charg | mph_charg_x | charg_x_especi | x_especi_much | especi_much_bigger | much_bigger_hell | bigger_hell_ride | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
text1 | 1 | 2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text2 | 2 | 2015-02-08 | Prob good though. Will give us time to replace 1st stage video transmitter (not needed for launch, but nice to have). . Air Force tracking radar went down. Launch postponed to same time tomorrow. . Rocket reentry will be much tougher this time around due to deep space mission. Almost 2X force and 4X heat. Plenty of hydraulic fluid tho. . Launching our 1st deep space mission today. Headed to Earth-Sun L1 gravity null point at 1M miles, 4X further than moon. | 0 | 0 | 4 | 1 | 1 | 1 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text3 | 3 | 2015-02-10 | Launch postponed to tomorrow due to high winds at the Cape, but Dragon still inbound from orbit in 90 mins . Extreme wind shear over Cape Canaveral. Feels like a sledgehammer when supersonic in the vertical. Hoping it changes … . "What Are The Civilian Applications?" https://m.youtube.com/watch?v=M8YjvHYbZ9w … | 0 | 0 | 2 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text4 | 4 | 2015-02-11 | Rocket soft landed in the ocean within 10m of target & nicely vertical! High probability of good droneship landing in non-stormy weather. . Primary mission on target. Spacecraft head towards the sun! All good there. . @DanielLockyer We could actually do that...maybe we should . Planning a significant upgrade of the droneship for future missions to handle literally anything. Maybe give it a Merlin for good measure :) . Can't delay any longer. Must proceed with primary mission to launch the Deep Space Climate Observatory spacecraft. . Mega storm preventing droneship from remaining on station, so rocket will try to land on water. Survival probability <1%. . Coming home pic.twitter.com/FmrmYs6R6V . Dragon splashdown off the California coast pic.twitter.com/4Bvfmei8I3 | 0 | 0 | 2 | 0 | 1 | 0 | 3 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text5 | 5 | 2015-02-12 | Landing on a stormy sea pic.twitter.com/7EY25g3IU5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
merged_df = merge(ngrams_data, tsla_prices, by="date", all.y = TRUE)
# Add increase column
merged_df$increase <- c(0, ifelse(diff(merged_df$TSLA.Adjusted) > 0, 1, 0))
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
tail(merged_df, 2)
date | X | tweet | X.can.. | X.test.. | X... | X.almost.. | X.give.. | X.go.. | X.good.. | ⋯ | not_quit_mph | quit_mph_charg | mph_charg_x | charg_x_especi | x_especi_much | especi_much_bigger | much_bigger_hell | bigger_hell_ride | TSLA.Adjusted | increase | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<date> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1373 | 2020-07-13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 99.804 | 0 |
1374 | 2020-07-14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 101.120 | 1 |
merged_df[,3:ncol(merged_df)-1][is.na(merged_df[, 3:ncol(merged_df)-1])] = 0
write.csv(merged_df, file="final_data.csv", row.names = F)
This code uses fread function to import the data. Fread uses C language, which speeds up the file importation. In this case we are not using the last dataframe created during the featuring engineering part for limitations due to computational power. Indeed, the resulting dataframe would result in a matrix 250GB large.
To overcome this issue we will use the dataset originated before the exploratory analysis, which contains the date, tfidf matrix, sentiment analysis and Yahoo Finance data.
library(data.table)
merged_df = fread('/kaggle/input/pre-data-visualization/tesla_sentiment.csv')
head(merged_df, 2)
date | tweet | ""can"", | ""test"", | """", | ""almost"", | ""give"", | ""go"", | ""good"", | ""launch"", | ⋯ | ""pretti"", | ""starship"", | word_scores | compound | pos | neu | neg | but_count | TSLA.Adjusted | increase |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<IDate> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <int> |
2015-01-29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0 | 0 | 13.68000 | 0 |
2015-01-30 | If you are curious about the P85D, you can schedule a test drive here: http://ts.la/dE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | {0, 0, 0, 1.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} | 0.318 | 0.141 | 0.859 | 0 | 0 | 13.57333 | 0 |
This code converts the "date" column in merged_df from a character string in the format "YYYY-MM-DD" to a numeric value using the as.Date() and as.numeric() functions from base R.
as.Date() converts the character strings to date objects and formats them using the specified format string ("%Y-%m-%d" in this case) and assigns the results to the "date" column.
Then as.numeric() is used to convert the dates into numeric values which represent the number of days since January 1, 1970 (also known as the Unix epoch). The resulting numeric values are stored back into the "date" column of merged_df.
merged_df$date <- as.Date(merged_df$date, format="%Y-%m-%d")
merged_df$date <- as.numeric(merged_df$date)
This code performs some data preprocessing for a time series analysis and classification task.
The first two lines convert the "date" column of the "merged_df" data frame into a numeric value, which is necessary for later use in modeling.
The third line selects only the "TSLA.Adjusted" column from "merged_df" and scales it using the "scale" function from the "stats" package. This is done to standardize the values and make them easier to work with in subsequent analyses.
The next three lines select all columns except for "date", "TSLA.Adjusted", "increase", "word_scores", and "tweet" from "merged_df" and scale them using the "scale" function. This is done to standardize these variables for use in a classification model.
Finally, the "labels" variable is created by selecting the "increase" column from "merged_df". This column contains binary values indicating whether the stock price increased (1) or decreased (0) on a given day. This will be the target variable in the classification model.
time_series = merged_df %>% select(TSLA.Adjusted)
time_series = scale(time_series)
nlp_data = merged_df %>% select(-date, -TSLA.Adjusted, -increase, -word_scores, -tweet)
nlp_data = scale(nlp_data)
labels = merged_df$increase
Attaching package: ‘dplyr’ The following objects are masked from ‘package:data.table’: between, first, last The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
ncol(time_series)
nrow(time_series)
ncol(nlp_data)
nrow(nlp_data)
This code conver the time series dataframe into a matrix, which is the required input format for the LSTM model
time_series = as.matrix(time_series)
This code trains a deep learning model using Keras to make predictions based on two types of data: time series and NLP (natural language processing) data. The model architecture includes two LSTM layers and a dense layer, and it is trained using binary cross-entropy loss and the Adam optimizer. The model is also evaluated using accuracy as a metric.
The steps in the model are:
Determine the number of time steps and features in the time series data, and the number of words and dimensions in the NLP data. Set the number of LSTM units and dense units for the model.
Reshape the time series and NLP data to match the input shape of the model. This is done using the array() function.
Define the input shapes for the time series and NLP data.
Define the model architecture using the layer_input(), layer_lstm(), layer_global_max_pooling_1d(), layer_dense(), and layer_concatenate() functions. This creates a model with two inputs (one for time series data and one for NLP data), two LSTM layers, a dense layer, and a concatenation layer to combine the outputs from the LSTM layers.
Compile the model using the compile() function. This sets the loss function, optimizer, and evaluation metric for the model.
Define a callback to save the best model based on validation accuracy. This is done using the callback_model_checkpoint() function.
Train the model using the fit() function. This trains the model on the input data and labels, with options for the number of epochs, batch size, validation split, and verbose level. It also includes the callbacks defined earlier for early stopping and saving the best model.
Evaluate the model using the evaluate() function. This computes the loss and accuracy of the model on the test data.
library(keras)
max_time_steps <- nrow(time_series)
num_features <- ncol(time_series)
max_words <- nrow(nlp_data)
embedding_dim <- ncol(nlp_data)
lstm_units <- 32
dense_units <- 64
# Reshape time series data to match the input shape of the model
time_series <- array(data = time_series, dim = c(nrow(time_series), max_time_steps, num_features))
# Reshape NLP data to match the input shape of the model
nlp_data <- array(data = nlp_data, dim = c(nrow(nlp_data), max_words, embedding_dim))
# Define input shapes for time series and NLP data
ts_input_shape <- c(max_time_steps, num_features)
nlp_input_shape <- c(max_words, embedding_dim)
# Define LSTM-based model architecture
ts_input <- layer_input(shape = ts_input_shape, name = "ts_input")
ts_lstm <- layer_lstm(units = lstm_units, return_sequences = TRUE)(ts_input)
ts_output <- layer_global_max_pooling_1d()(ts_lstm)
nlp_input <- layer_input(shape = nlp_input_shape, name = "nlp_input")
nlp_lstm <- layer_lstm(units = lstm_units)(nlp_input)
nlp_output <- layer_dense(units = dense_units, activation = "relu")(nlp_lstm)
combined <- layer_concatenate(inputs = list(ts_output, nlp_output), name = "combined")
output <- layer_dense(units = 1, activation = "sigmoid")(combined)
model <- keras_model(inputs = list(ts_input, nlp_input), outputs = output)
# Compile the model
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(),
metrics = c("accuracy")
)
# Define a callback to save the model with the best accuracy
best_model_callback <- callback_model_checkpoint(
filepath = "best_model.h5",
save_best_only = TRUE,
monitor = "val_accuracy",
mode = "max"
)
# Train the model with early stopping and the best model callback
history <- model %>% fit(
x = list(time_series, nlp_data),
y = labels,
epochs = 2000,
batch_size = 32,
validation_split = 0.2,
verbose = 2,
callbacks = list(
callback_early_stopping(patience = 100),
best_model_callback
)
)
# Evaluate the model
scores <- model %>% evaluate(
x = list(time_series, nlp_data),
y = labels,
batch_size = 32
)
# Print the evaluation metrics
cat("Test loss:", scores[[1]], "\n")
cat("Test accuracy:", scores[[2]], "\n")
Test loss: 0.8049979 Test accuracy: 0.7503639
This code is splitting a merged dataset into training and testing sets, and creating corresponding labels for each set.
The split_idx variable is defined as 80% of the number of rows in the merged_df dataframe. Then, the dplyr library is loaded to use later.
The time_series_train variable is created by selecting the first split_idx rows of the time_series matrix, and converting it to a matrix using the as.matrix() function. The time_series_test variable is created by selecting the remaining rows of the time_series matrix, and also converting it to a matrix using the as.matrix() function. Similarly, nlp_data_train and nlp_data_test are created by selecting the corresponding rows from nlp_data.
The train_labels variable is created by selecting the increase column from the first split_idx rows of the merged_df dataframe. Similarly, test_labels are created by selecting the increase column from the remaining rows of the merged_df dataframe.
split_idx <- round(nrow(merged_df) * 0.8)
library(dplyr)
time_series_train <- time_series[1:split_idx, ]
time_series_train <- as.matrix(time_series_train)
time_series_test <- time_series[(split_idx + 1):nrow(time_series), ]
time_series_test <- as.matrix(time_series_test)
nlp_data_train <- nlp_data[1:split_idx, ]
nlp_data_test <- nlp_data[(split_idx + 1):nrow(nlp_data), ]
train_labels = merged_df[1:split_idx, ]$increase
test_labels <- merged_df[(split_idx + 1):nrow(merged_df), ]$increase
This code defines a LSTM-based neural network model for a combined time series and NLP (natural language processing) dataset. Here is a brief summary of the steps:
Define the necessary input shapes, model architecture, and hyperparameters such as LSTM units, dense units, dropout rate, and learning rate. Reshape the input data to match the expected input shape of the model. Define the Nadam optimizer with a specific learning rate and early stopping based on validation accuracy. Compile the model with binary cross-entropy loss, Nadam optimizer, and accuracy metric. Train the model with the defined callbacks to save the best model and stop training if there is no improvement in validation accuracy. The model is trained with a batch size of 32 and for 2000 epochs. The model architecture consists of two LSTM layers, one for time series data and one for NLP data, followed by dropout layers to prevent overfitting, dense layers with rectified linear unit (ReLU) activation function, and concatenation layer to merge the outputs from the two LSTMs. Finally, a dense layer with sigmoid activation function is used to obtain the binary classification output.
library(keras)
max_time_steps <- ncol(time_series_train)
num_features <- nrow(time_series_train)
max_words <- ncol(nlp_data_train)
embedding_dim <- nrow(nlp_data_train)
lstm_units <- 32
dense_units <- 64
dropout_rate <- 0.1
learning_rate <- 0.01
# Reshape time series data to match the input shape of the model
time_series_train <- array(data = time_series_train, dim = c(nrow(time_series_train), max_time_steps, num_features))
time_series_test <- array(data = time_series_test, dim = c(nrow(time_series_test), max_time_steps, num_features))
# Reshape NLP data to match the input shape of the model
nlp_data_train <- array(data = nlp_data_train, dim = c(nrow(nlp_data_train), max_words, embedding_dim))
nlp_data_test <- array(data = nlp_data_test, dim = c(nrow(nlp_data_test), max_words, embedding_dim))
# Define input shapes for time series and NLP data
ts_input_shape <- c(max_time_steps, num_features)
nlp_input_shape <- c(max_words, embedding_dim)
# Define LSTM-based model architecture
ts_input <- layer_input(shape = ts_input_shape, name = "ts_input")
ts_lstm <- layer_lstm(units = lstm_units, return_sequences = TRUE)(ts_input)
ts_dropout <- layer_dropout(rate = dropout_rate)(ts_lstm)
ts_output <- layer_global_max_pooling_1d()(ts_dropout)
nlp_input <- layer_input(shape = nlp_input_shape, name = "nlp_input")
nlp_lstm <- layer_lstm(units = lstm_units)(nlp_input)
nlp_dropout <- layer_dropout(rate = dropout_rate)(nlp_lstm)
nlp_output <- layer_dense(units = dense_units, activation = "relu")(nlp_dropout)
combined <- layer_concatenate(inputs = list(ts_output, nlp_output), name = "combined")
output <- layer_dense(units = 1, activation = "sigmoid")(combined)
model <- keras_model(inputs = list(ts_input, nlp_input), outputs = output)
# Define the Nadam optimizer with a specific learning rate
optimizer <- optimizer_nadam(learning_rate = learning_rate)
# Define early stopping based on validation accuracy
early_stopping <- callback_early_stopping(
patience = 100,
monitor = "val_accuracy",
mode = "max"
)
# Compile the model
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer,
metrics = c("accuracy")
)
# Train the model
history <- model %>% fit(
x = list(time_series_train, nlp_data_train),
y = train_labels,
epochs = 2000,
batch_size = 32,
validation_data = list(list(time_series_test, nlp_data_test), test_labels),
callbacks = list(
callback_model_checkpoint("best_model_advanced.h5", save_best_only = TRUE, verbose = 1),
early_stopping
)
)
# Save the best model as h5 file
save_model_hdf5(model, "best_model_advanced.h5")
Move the model from input data to ouput data in Kaggle for further reuse
# Create output directory if it doesn't exist
dir.create("output")
# Copy model file from input directory to output directory
file.copy(from = "/kaggle/input/model4/best_model6.h5",
to = "/kaggle/working/output/best_model6.h5")
This code loads a saved Keras model from an HDF5 file, evaluates the model's performance on the test data, and prints the test loss and accuracy.
First, the load_model_hdf5() function from the Keras package is used to load the saved model from the file "best_model6.h5" into the model object.
Next, the evaluate() function is used to evaluate the model's performance on the test data. The input data and labels are provided as lists (list(time_series_test, nlp_data_test) and test_labels, respectively), and the batch size is set to 32.
Finally, the test loss and accuracy are printed using cat() and paste0() functions. The test loss is obtained from the first element of the test_loss_and_metrics object, and the test accuracy is obtained from the second element.
# Load the best saved model
library('keras')
model <- load_model_hdf5("/kaggle/working/best_model_advanced.h5")
# Evaluate the model on the test data
test_loss_and_metrics <- model %>% evaluate(
x = list(time_series_test, nlp_data_test),
y = test_labels,
batch_size = 32
)
# Print the test loss and accuracy
cat(paste0("Test loss: ", test_loss_and_metrics[1], "\n"))
cat(paste0("Test accuracy: ", test_loss_and_metrics[2], "\n"))
Test loss: 0.684055924415588 Test accuracy: 0.556363642215729
As we can observe from the two accuracies. The first and simpler model performs incredibly better than the more complex one, 75% vs 56% accuracy. The more complex model increases the number of layers, adding pooling and drop out. Moreover, it splits the dataset in train and test using and index as slicer, in order to make sure that the data are suitable for a time series splitting. While the first one lacks of these features, and the validation data is sampled randomly from the dataset. This let us assume two conclusions: